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Finite sample size corrections to the reparametrization-invariant solution of the inverse problem 
of probability are computed, and shown to converge uniformly to the correct distribution. 



I. INTRODUCTION 



On 



A basic problem in statistics, with applications in divers fields, is the determination of the probability distribution 
that underlies a finite set of experimental results involving continuous variables. From the practical point of view, 
one usually has a definite finite-parameter model for the probability distribution on theoretical grounds, and the 
parameters are fixed by fitting to the data set. As is obvious, this parametric inference introduces a theoretical bias, 
since the true distribution may not even lie in the parameter set chosen. It is therefore of interest to attempt a direct 
determination of the probability distribution without resorting to finite-dimensional approximations. Of course, for 
^ | any finite data set, one obtains only a probabilistic description of the probability distribution, but the spread of 
possible probability distributions decreases as the size of the data set is increased |l| . 

This set-up is fundamental in pattern theory, a context in which reparametrization invariance is of great importance. 
In visual information processing algorithms, for example, if the inferred pattern depends on, say, the orientation of the 
object, then the object recognition capabilities of the algorithm are not likely to be efficient. In speech recognition, 
variations in the tempo of speech should, again, be treated as reparametrizations of the same underlying pattern, as 
should frequency variations. 

I ' In the Bayesian approach to the problem, one starts by assuming a space of probability distributions, {w}, within 
which the true distribution lies, and given a set of observations, /, one obtains a probabilistic description of the true 
' distribution by using 

O- w m prob(/|w) • prob(w) 

Pr°bH/) = — ^ ; (1) 

On ; 

""^j^, in words, the probability of the distribution w being the true distribution, given /, is the probability of / given 
■ the distribution w, multiplied by the probability of w occuring in the space of all distributions, normalized by the 
probability of / occuring in any of the distributions in the set {w}. The maximum likelihood (ML) estimate of w is 
then that distribution w that maximizes prob(/|iw) • prob(w;). To go further in this direction, one needs to figure out 
what the a priori probability prob(w) should be, and how to compute the ML estimate. Clearly, to avoid bias, one 
""O wants the space of distributions to be as general as possible, consistent with computability. 

In one dimension, Bialek, Callan and Strong Q have recently given an elegant formulation of this problem for 
^ , continuous distributions in one dimension. They used Bayes' rule to write the probability of the probability distribution 
1 Q, given the data {xt}, as 

prn , v, i _ P[x u x 2 ,...,x N \Q(x)]P[Q(x)] _ Q(x 1 )Q(x 2 )---Q(x N )P[Q(x)] 

[yW|Sl,a!j ^ J P( Xl ,x 2 ,...,x N ) J[dQ(x)}Q(x 1 )Q(x 2 )---Q(x N )P[Q(x)Y [Z) 

where the factors Q{xi) arise because each Xi is chosen independently from the distribution Q(x), and P[Q] encodes 
the a priori hypotheses about the form of Q. The optimal least-square estimate of Q, Qest^x, {xi}), is then 

n ( r u _ (Q(x)Q(xi)Q(x2)---Q(x N ))W 

y ° sti ' i4|j ~ (Q(x 1 )Q(x 2 )---Q(x N ))W ' W 

where (• • >)W denotes an expectation value with respect to the a priori distribution P[Q(x)]. 

In this field-theoretic setting, Ref. j| assumed that the prior distribution P[Q] should penalize large gradients, so 
written in terms of an unconstrained variable cj) = \n(tQ) G (— oo, +oo), they assumed 



Pe[<f>(x)] = -^exp 



x 5 



(4) 



with I a global length parameter that they later fixed by means of other considerations. This form for the prior 
distribution is very simple, and quite minimal in terms of underlying assumptions, so is almost an ideal example of 
the Bayesian set-up. 
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There is, however, an important aspect in which this prior distribution does not measure up — a probability distri- 
bution is a density, and hence transforms in a specific manner under reparametrizations of the data, e.g. given a set 
of TV observations Xi, we want to infer a distribution w x that is related in a very simple way to the distribution w cxp 
inferred from exp(xi) : 

w x (z) = exp(z) • u> oxp (exp(z)). (5) 

This reparametrization covariance of the estimated distribution turns out to be a strong constraint on the distribution 
prob(w), and on the actual computation of the estimated distribution. In recent work, I found a geometric modification 
of the approach of Ref. |Q, which satisfies this desired reparametrization covariance pL It is the details of this 
solution, especially the corrections for finite sample size (which are obviously of paramount importance for practical 
applications), that are explained in the present paper. 

This paper is organized as follows: Section 2 contains a short review of Ref. 0, to keep the presentation here 
relatively self-contained. In Section 3, the one-dimensional solution for finite sample sizes is given, along with a 
discussion of corrections arising from fluctuations about the saddlepoint distribution. In section 4, I present some 
conclusions, with comments on the solution of the inference problem in dimensions larger than 3. 



II. REVIEW 



Instead of the form (eq. used in Ref. O, I write Q(x) = \Jh(x) exp(<f>(x)), with h(x) a metric in one dimension. 
Hence y/h(x) is a scalar density, and so is Q, which is crucial for a reparametrization covariant solution. An intuition 
for the role that h plays is that it is a local analogue of £, determining binning intervals locally. Set 



P[4>, h} = — exp 



dx^/h(x) (d x (, 



(6) 



The inverse of the metric is just l/h{x) and the reparametrization invariant volume element is y // h(x)dx, so P[4>,h] 
is reparametrization invariant. Now, we want to evaluate 



(Q(x 1 )Q(x 2 )---Q(x N ))^ 



D(j> 



Dh 
Diff+ 
dX 



N 



Z / 2tt 



Dcj> 



Dh 

Diff-i 



exp [— 5(0, h; A)] 



where 



1 f -l r N 1 

5(0, h; A) = - / dx^/hjx) {d x <t>) 2 + i\ / dxy/Hx)^ - Yftfa) + o - *A 



(7) 
(8) 

(9) 



Notice that the integral over all metrics has been divided by the volume of the group of orientation preserving 
diffeomorphisms. In one dimension, this division eliminates all but one global degree of freedom from the metric. 
There is no operational way to distinguish between the factor yjh(xi) and exp(0(xj)) — these must occur together in 
Q(xi). Taking the local symmetry into account, the number of local degrees of freedom is the same as in the approach 
of Bialek et al. |§. 

The equations of motion that follow from varying S are 



■htf) 2 \ + *acx P (0) - J2 -7=r^ s ( x - *i) = o 
(: 



2 h 
J dxV~hexp((j)) = 1 

where I have used primes to denote d/dx. S(x) denotes the scalar density such that J dxS(x) = 1. Introduce a variable 



(10) 

(11) 
(12) 
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y(x) = / y/h(s)ds 



then it follows from eq. [lO] and eq. [ll] that 



f0 
dy 2 



dy 



(13) 



(14) 



Note that the use of eq. is limited by the non-singularity of yh. 

It is now necessary to be careful about the limits of integration. This care in the boundary terms is unnecessary at 
N = oo, in accord with the fact that any finite data set will not indicate the true limits of the probability distribution. 
Suppose that x ranges from x_ to x+, then integrating eq. [H]l find 



N-i\ + 



dx(-i=0')' = 



so, using eq. pJL it follows that 



cxp(-0) = (y+ - y-) 



(y - c) 5 



(y- - c)(y+ - c) 



(15) 



(16) 



where c is an arbitrary constant of integration, c is restricted only by c > y+ > y_, or y + > y_ > c, since exp(<f>) 
must be positive. This 'cyclic' constraint on y +1 y_, c arises because the algebra of infinitesimal reparametrizations of 
the line has a subalgebra isomorphic: to sl(2, R). 
In order to determine y(x), which satisfies 



(y+ - c)(y - y-) 

{y+ -y-)(y-c) 



dx' 



(17) 



observe that the cross ratio on the left is projectively invariant, i.e., invariant under transformations of the form 

az + 8 

z ^ — ZT' 

7z + o 



(18) 



with a, 8, 7,(5 real, a8 — 8j — 1. This amounts to a three-parameter family of equivalent solutions y(x) determined 
by the data. This projective invariance can be fixed by setting c = oo, y+ — 1 and y_ = 0, which implies (f> = 0. 

Operationally, this means that the next measured data point will be observed with equal probability at any value 
of y in the interval [0, 1]. An important point to notice about this solution is that it is only valid at N — oo. At finite 
N, the metric determined by this solution is not smooth, being a sum over S functions, hence we must take more care 
in solving eq.'s pl3|,pl]Jl^ for finite N. Indeed, the action evaluated at this solution is infinite for finite N. 



III. FINITE TV CORRECTIONS 



Obviously, for any practical application of this theory, one needs a systematic scheme for computing finite sample 
size corrections. To understand the theory at finite N, and to make contact with the higher-dimensional theory 
commented on in Section 4, we rewrite eq. as 



S 



dy{d y (/>) 2 +iX[ dye 



1 



TV 



dyP y {y)<P(y) - / dxP x In y/ h(x) 



(19) 



where y{x) is as in eq. |l3|, P x {x) = Y] 8{x — xi), and P y (y)y' = P x (x). It is clear in this form that there are two 
independent functional variations of S, one with respect to <p(y), and the other with respect to y(x) : 



-d 2 v (jy + iXeM^)~ NP v {y) = 



Px(x) 
y'{x) 



. 



(20) 
(21) 
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For N large, suppose P x (x) — P{x)+^j=p{x), with P(x) a continuous density and p{x) : (p(x)p(x')) — P(x)S(x—x'), 

following Ref. J2j. Now, a solution to eq.'s ^ and |l| at leading order in N is fa = 0, and y/h(x) = d x yo(x) = P(x), 
as reviewed in Sect. 2. Indeed, it would seem from eq. [l?] that this solution is exact, and no modification for finite N 
is necessary. This is not true, since for finite sample sizes eq. [f^ defines a singular metric, and hence is inconsistent 
with our calculations which have assumed that the metric is non-singular implicitly. 

We can, however, solve eq. [H] to get 4> = <j) \/h, P x , and then use this solution in eq. [w] to determine h. We shall 

find that the leading correction at finite N is a correction to <f> at 0{N~ X / 2 ), with no correction to y/h = P(x). This 
ordering of the computational steps parallels the calculation in , with the determination of h here the analogue of 
the determination of the length scale in What is different in the present approach, is that the formulae derived 
here are reparametrization-covariant, e.g., square roots of densities do not appear. To leading order, we find iX — N, 
and 



exp(^o) 



P(x) 
y/h ' 



(22) 



Notice that eq. ^2| has the correct reparametrization covariance by construction, with a function equated to a ratio of 
densities. 

At the next order, expanding eq. [ll] about fa,iX = N, we find 



1 



-d~ 



Eq. is easily solved to obtain 



-d x + JVe 00 



k = (~A h + Ne*°) fa = 



(f>x(x) 

with the leading term in K given by 



P(x') 
\/h(x') 



A h Mx') 



P{x) 
\/h{x) 



K(x,x'), 



*h<Po 



K(x,x') 



1 



2VN 



[e^(x)e*°(x')] 



-1/4 



exp 



ma>t( X , X ) ' 

V / xfhe^^dx" 

min(a;,x / ) , 



(23) 



(24) 



(25) 



Note the reparametrization invariance of eq. |24j. 

We can now insert eq.'s [2^]24| in eq. [)| and vary with respect to h to obtain h = h(P x ). This order of solving 
the saddlepoint equations is somewhat simpler than solving the equations obtained by independent variations — the 
results are, obviously, independent of this order of solution. We find 



SredVA = ~N J dx 











j dx 


P+^P 




(NPfa 




Vn _ 





In y/h - In P - 



-y/Np 



In y/h + <f>x - InP 



(26) 



with 0i = <j)\{h). Now, it is important to note that while S re d appears to include terms with powers of N that should 
be ignored at first glance, the variation of <f>\ with respect to h is 0(1) even though fa = 0(N^ 1 ^ 2 ). Thus, varying 
eq. [26| with respect to h, we find 

-NP^= (in y/h - In p) + O(VN) = 
2 Sy/h \ ' 

so to leading order in N, y/ho — P,fa — so our old result is recovered. However, at the next order, we find no 
correction to h of 0(iV -1 / 2 ). Thus the leading finite size correction is entirely encoded in eq. [24|, with y/h = P. At 
higher orders in N -1 / 2 , there are, of course, further corrections to fa and to h. 

Consider now the computation of the functional integral expanded about the saddlepoint solution found above. 
The division by the volume of the group of diffeomorphisms eliminates the integration over fluctuations in the metric, 
so we are left with computing the integral over the fluctuations in fa This integration gives the operator determinant 
det~ 1//2 (— Ah + Ne^ 1 ). This determinant can be computed in the large N limit by the standard van Vleck technique, 
explained in Ref. Q, for example. Thus 



det(-A p2 +JVe^) ^ocexpf-^ J dxP{x)e^ /2 ^j , 



(27) 
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which differs from the corresponding expression in Ref. |2j. Eq. |27] is reparametrization invariant. Putting everything 
together, eq. ^ is given by 

J[P( Xi )exp (-^L J dxP(x)e^/ 2 -I J dxPix)- 1 ^) 2 ^ . (28) 

Eq. |28| can be used to evaluate the least-square estimate of the inferred distribution. 
Finally, we compute the magnitude of the correction <f>i : 

(0l(x) 2 ) conn cctod ~ r— , (29) 

4v N 

independent of x, implying uniform convergence to the solution reviewed in Sect. II. This uniform convergence is 
of great practical importance, since it implies that our estimated distribution will make sense even in regions where 
P is small. The uniformity can be directly traced to the reparametrization invariance of our computations, which 
enables the binning size to be adjusted locally enabling a more accurate determination of the distribution. As this 
uniformity is of great importance for practical applications, it is all the more gratifying that a re-analysis motivated 
by conceptual concerns regarding reparametrization covariance leads to a computational improvement. 



IV. CONCLUSIONS 



The statistical inference problem is of great interest in biophysics, for example, when a; is a vector variable. Bialek, 
Callan and Strong 0] gave an account of a higher dimensional variant of their theory, but I want to add a few remarks 
here regarding a geometric version sketched in 0], taking into account reparametrization invariance. In Sect.'s II 
and III, we saw that a reparametrization-covariant Bayesian set-up could be obtained by introducing a metric, and a 
scalar field, dividing out by the group of diffcomorphisms. Since a metric in one dimension has only one global degree 
of freedom, this left only one local scalar degree of freedom, which is what we expect to need to fit the data, since the 
data determines a scalar density on the interval x+]. 

In higher dimensions, the aim is still to determine a scalar density, but the number of local degrees of freedom 
in the metric in d dimensions is d{d + l)/2 and reparametrizations form a d parameter local symmetry group. It is 
clear then that a geometric formulation will require a reparametrization invariant constraint to reduce the number of 
degrees of freedom in the metric, to the required d + 1 local degrees of freedom. 

Reparametrization invariant constraints must be given by the vanishing of tensor quantities, so in this case we have 
to construct appropriate tensors from the metric. Investigating the constraints provided by the vanishing of various 
curvature tensors, since we expect only a scalar local degree of freedom to survive after imposing the constraint and 
dividing out by reparametrizations, we are naturally led to consider conformally Euclidean metrics, which can indeed 
be characterized by the vanishing of a tensor constructed from the curvature, the Weyl curvature, . W vanishes 

identically in d < 3, so the geometric theory will only be valid for d > 3. 

The proposed generalization of eq. ^| to higher dimensions is now given by 



P[h a0 ] oc exp (- J d d xVhC^j S (\ - J d d x^hjx)^ J[ 6 (W aPl \x)) 



(30) 



where £ is a scalar constructed out of the metric h a p, constrained to be a metric of vanishing Weyl curvature, W ( f^ 7 , 

and \[~h = det 1 / 2 '{h a p) as usual. 

A variety of issues, some technical, have to be addressed in order to define eq. ^ precisely: (i) The integration 
measure for integrating over metrics is notoriously nonlinear; (ii) Solving the constraint W = 0, and then extracting 
the volume of the group of diffeomorphisms will leave a Jacobian; (iii) What is an appropriate choice for £, given the 
constraint W = 0? 

Before tackling the technical issues, consider the logic of what we wish to do. Metrics with vanishing Weyl curvature 
are of the form 

Kp{x) = e 2 ^(*» S Se ee JiJ^^ 8 5e . (31) 

For such metrics, \fh = det(J)e d ^. Now, following our steps in Sect. 2, we want to extremize (ignoring technicalities) 
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d d xdet(J)e d(t > £[cj)( f(x))} + iX (J d d xdet{J)e d4, -lj-N J d d xP(x) [In det J + 4>] , (32) 

where NP(x) =Y,S(x- Xi). Now, P(x) is a density, so d d xP(x) = d d fP(f), with P(f) det(J) = P(x). We can 
rewrite eq. [32] as 

S = J d d /e^£[0(/)] + iX (j d d fe d,t > - lj - N J d d xP{x) In det J - N J d d fP{f)cj) , (33) 

It follows that there are two independent functional variations, one with respect to 4>(f), and the other with respect 
to f(x), the latter giving 



d 

dx" 



= 0. (34) 



Eq. |34| is the analogue of eq. ^T], and determines the coordinates f a in terms of the coordinates x a in which the data 
is presented, and is reparametrization-invariant because P{x) is a density, not a function. Consider the special case 
x = f : eq. |34| is the requirement that the density P in the coordinates / is the constant density, exactly analogous to 
our result in one dimension, with <f> a constant. 

It is amusing to explicitly construct /" coordinates which satisfy 

df a 

det^ = P(.x). 

We construct these coordinates essentially iteratively as a one-dimensional distributions with parameters. If we 
couldfind / : 

f l (x a ) = fix 1 ), f(x<*) = f(x\x%... 

then we would get 

df a _ ^ r df a 
d6t 8x0 -ll-dx^ 

a 

so the problem is one of factoring P appropriately. Define 



« d 
P 1 {x 1 )= P(x\x 2 ,x 3 ,...)Y[dx a , 

„ d 

P 2 (x 1 ,x 2 )P 1 (x 1 ) = / P(x\x 2 ,x 3 ,...)]]_dx a , 

** o 



d-1 



P d (x a ) J] ^ = P( xU )> ( 35 ) 
/3=1 



then 



/V)= / P 1 (z)dz, 

r 2 

f 2 (x\x 2 )= / P 2 (x\z)dz, 



(36) 

is an explicit set of coordinates in which the given density P is constant. Notice that each /" varies from to 1, as 
befits integrals of one-dimensional normalized probability distributions, P a . 



G 



As explained in Ref. 's [gjj] , we are interested in extracting a continuum distribution from some set of binned data, 
as independent as possible of the binning. In renormalization group terminology, we are interested in the 'infrared' 
properties of the data, independent of the 'cutoff'. The terms most relevant in the infrared are precisely the terms 
with the fewest derivatives. Thus, since Einstein-Hilbert action is a total derivative for conformally Euclidean metrics, 
we consider C = — ^i?A~ 1 i?, where A is the Laplacian, and R is the Ricci scalar constructed from h a p. This Lagrange 
density is non-local for general metrics in dimensions larger than two, but is local for metrics with W = Q, 

f*xVhC= ^ / *«/ e^ 2 >^|^. (37) 

The steps carried out in the previous section for determining the finite N corrections can be carried out in an exactly 
analogous fashion for this action for the leading correction, using the Laplacian in d dimensions in the / coordinates 
explicitly constructed above. At higher orders in N, the nonlincarity of C leads to a distinctly different computation. 
I hope to address these issues, along with the technical issues mentioned above, elsewhere. 

In conclusion, I have computed the finite sample size corrections to the geometric solution of the statistical inference 
problem I gave in earlier work. The uniform convergence of these corrections should be important for applications so 
it is of interest to test this theoretical construction in practice. 

I am grateful to V. Balasubramanian and C. Callan for conversations. This work was supported in part by NSF 
grant PHY96-00258. 
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